Human resources are the most valuable asset in any country¹. They are the main reason behind the success or the failure of any organization. In fact, having an educated and competent manpower is the key driver to economic and social development. In this context, the importance of academic education has become undeniable. Therefore, it is crucial to invest money and time in order to study students’ academic performance and figure out effective ways to improve it.
Given the importance of the topic, it has been given particular attention in past research. In fact, many studies have been conducted in order to analyze the factors impacting students’ academic performance. While some studies focused on the psychological variables, such as Franck Amadieu & André Tricot’s research², other researchers have been interested in the impact of other elements such as mobility ³, gender and other socio-economic factors on students’ academic success.
Many reasons motivated us to choose this topic of research. In fact, as students, we are very passionate about the educational field. Thus, we want to provide through this project a detailed analysis that can be used as a reference guide for leaders working in the educational field. Mainly, we want to help schools and universities to have a better understanding of the factors influencing students’ academic performance in order to improve their decision-making processes, students’ success rate and eventually their overall organization.
Source: ¹ Gestion des ressources humaines,Jean-Marie Peretti, 2004. ² Psychological factors which have an effect on student success ,2015. ³ La migration pour études : Regards d’intervenants sur l’accueil et l’intégration des nouveaux étudiants »,2009.
The aim of the project is to understand the evolution of secondary academic performance in France. Our study will mainly focus on 3rd grade students (equivalent to 11th grade in Switzerland) and their results on the Diplôme National du Brevet (DNB) by school.
First, we will observe whether there are improvements or, on the contrary, deterioration in admissions of DNB over the years. From this dataset, we will also make comparisons, particularly at the geographical level, and an analysis of the success rate in terms of distinction for each school.
Then, we will try to understand if there is a correlation between academic success and some socio-economic factors, such as the type of accommodation, the single-parent families rate, and the involvement of schools in students’ physical and sports practice. Finally, despite these factors, we will investigate whether the COVID-19 pandemic has had a direct negative impact on students’ school performance.
What is the evolution of student performance over time and across the different regions/departments of France?
Do socio-economic factors such as the type of accommodation, family situation or college policies have an influence on student success ?
Has the COVID-19 pandemic impacted student performance?
Import data
We have realised that some wrangling are necessary for each data sets. We have established a checklist that we will go through for each data set. We have to :
As we will rename all data sets in the same way, we found it useful to create a function. The function has a data frame and a vector as inputs. It checks if the length of the vector is correct, if so it returns a tibble with the column names renamed
rename_df <- function(df, x){
if (ncol(df) == length(x)){
names(df) <- c(x)
df <- as_tibble(df)
} else {
stop("Vector is not the right length")
}
}
This dataset presents the results of the diplôme national du brevet by school, for schools in metropolitan France and for the overseas departments and regions. This data set contains 139’580 observations.
dnb_colnames <- c("session", "school_id", "school_type", "establishment_name", "education_sector", "municipality_code", "municipality", "department_code", "department", "academy_code", "academy_name", "region_code", "region", "registered", "present", "admitted", "admitted_without", "admitted_AB", "admitted_B", "admitted_TB", "success_rate_pct"
)
dnb_results <- rename_df(DNB_par_etablissement, dnb_colnames)
dnb_results[["success_rate_pct"]] <- as.double(gsub("%","",
gsub(",",".", dnb_results[["success_rate_pct"]])))
We need to harmonize the year variables of all the other data sets to match the logic of this one. The year is the year of the exam session (e.g academic period “2020-2021” is represented as 2021)
We need to add a column department_fr that with department names matching the ones in ggplot.
dnb_results$department_fr <- stri_trans_general(dnb_results$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
dnb_results <- dnb_results %>%
mutate(without_pct = admitted_without/admitted*100,
AB_pct = admitted_AB/admitted*100,
B_pct = admitted_B/admitted*100,
TB_pct = admitted_TB/admitted*100
)
We can see the final table dnb_results below.
This data set gathers all schools which have been awarded the “Generation 2024” label. Its aim is to develop bridges between the school world and the sports movement in order to encourage young people to practice physical activity and sport. This data set contains 6’883 observations.
est_24_names <- c("region", "academy", "department", "municipality", "establishment", "school_id", "school_type", "education_sector", "postcode", "adress", "adress_2", "mail", "students", "priority_education", "city_school", "QPV", "ULIS", "SEGPA", "sport_section", "agricultural_high_school", "military_high_school", "vocational_high_school", "establishment_web", "SIREN_SIRET", "district", "ministry", "label_start_date", "label_end_date", "y_coordinate", "x_coordinate", "epsg", "precision_on_localisation", "latitude", "longitude", "position", "engaging_30_sport")
establishment_24 <- rename_df(Etablissements_labellises_generation_2024, est_24_names)
establishment_24 <- establishment_24 %>%
mutate(session_started = case_when(month(label_start_date) <= 7 ~ year(label_start_date),
month(label_start_date) > 7 ~ year(label_start_date)+1),
session_ended = case_when(month(label_end_date) <= 7 ~ year(label_end_date),
month(label_end_date) > 7 ~ year(label_end_date)+1)
)
establishment_24$department_fr <- stri_trans_general(establishment_24$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
establishment_24 <- establishment_24 %>%
mutate(high_school_type = case_when(agricultural_high_school == 1 ~ "agricultural high school",
military_high_school == 1 ~ "military high school",
vocational_high_school == 1 ~ "vocational high school")) %>%
select(region:sport_section, high_school_type, establishment_web:session_ended)
We can see the final table establishment_24 below.
This dataset records enrolment in secondary schools according to the type of accommodation for pupils: half-board, boarding school etc. This data set contains 32’096 observations.
housing_names <- c("year_back_to_school", "Academic_region", "academy", "department", "municipality", "school_id", "establishment_main_name", "establishment_name", "school_type", "education_sector", "students_secondary_education", "students_higher_education", "external_students_secondary_education", "half_boarders_students_secondary_education", "boarding_students_secondary_education", "external_students_higher_education", "half_board_students_higher_education", "boarding_students_higher_education")
student_housing <- rename_df(Hebergement_eleves_etablissements_2d, housing_names)
No Problem for this data set
We need to create a session variable as year_back_to_school refers to the beginning of the school year and not the exam session.
student_housing <- student_housing %>%
mutate(session = year_back_to_school + 1) %>%
select(year_back_to_school,session, everything()) #here just to order variables
student_housing$department_fr <- stri_trans_general(student_housing$department, "Latin-ASCII") %>%
str_to_title(.) %>%
gsub("Du", "du", .) %>%
gsub("De", "de", .) %>%
gsub("D'", "D", .) %>%
gsub("Et", "et", .) %>%
gsub(" ", "-", .) %>%
str_replace_all("Corse-du-Sud", "Corse du Sud") %>%
str_replace_all("deux-Sevres", "Deux-Sevres") %>%
str_replace_all("Alpes-de-Hte-Provence", "Alpes-de-Haute-Provence") %>%
str_replace_all("Territoire-de-Belfort", "Territoire de Belfort") %>%
str_replace_all("Seine-Saint-denis", "Seine-Saint-Denis")
We can see the final table student_housing below.
This data set provides information about the single-parent families in each municipality. The census has been made every five years since 2008. This data set contains 104’986 observations
sg_parent_names <- c("geocode", "municipality", "year","sing_par")
This is a time based data set that gives us information on the COVID tests and results carried out by laboratories, hospitals, pharmacists, doctors and nurses. It is updated daily. On the 11th October, the data set contained 543’974 observations.
Translate the column names.
can try to implement API (further idea)
#1
covide_names <- c("department_code", "test_week", "educational_level", "age_group", "pop", "positive", "tested", "incidence_rate", "positivity_rate", "screening_rate")
covid_in_schools <- rename_df(Covid_sp_dep_7j_cage_scol_2022_10_10_19h02,covide_names)
covid_in_schools[["positive"]] <- as.double(gsub(",",".", covid_in_schools[["positive"]]))
covid_in_schools[["incidence_rate"]] <- as.double(gsub(",",".", covid_in_schools[["incidence_rate"]]))
covid_in_schools[["positivity_rate"]] <- as.double(gsub(",",".", covid_in_schools[["positivity_rate"]]))
covid_in_schools <- covid_in_schools %>%
mutate(test_date = ymd(substr(test_week,1,10)),
session = case_when(month(test_date) <= 7 ~ year(test_date),
month(test_date) > 7 ~ year(test_date)+1))
Only department code. We need to input the names of the departments.
To simplify the dataset, we need to drop all but 11-15 educational_level.
We can see the final table covid_in_schools below.
We will use the ggplot France map for our visualizations
map <- map_data("france")
The region variable is in fact the departments. A quick renaming is necessary.
colnames(map)[5]<- "department_fr"
We are mostly exploring our options and have yet to truly deep dive into this portion of our project.
##National analysis
National DNB statistics
National DNB statistics (rate)
Number of admitted by region
###{-}
###{-}
First analysis of the best performing highest rate of TB -> Paris 2020
Lowest performing department in TB rate and “best” in zero mention rate -> Guyane 2006
Creation of the map theme
map_theme <- theme(title=element_text(),
plot.title=element_text(margin=margin(20,20,20,20), size=18, hjust = 0.5),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.grid.major= element_blank(),
panel.background= element_blank())
Creation of the dataset used for the map
result <- dnb_results %>%
select(department_fr, success_rate_pct) %>%
group_by(department_fr) %>%
summarise(mean = mean(success_rate_pct))
Join the map from ggplot and our new dataset
result_map <- left_join(x = map[,-6], y = result)
plot
regression Essai
lm1 <- lm(dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct + dnb_results$AB_pct + dnb_results$without_pct)
summary(lm1)
#>
#> Call:
#> lm(formula = dnb_results$TB_pct ~ dnb_results$without_pct + dnb_results$B_pct +
#> dnb_results$AB_pct + dnb_results$without_pct)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -2.30e-09 0.00e+00 0.00e+00 0.00e+00 1.53e-10
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.00e+02 1.53e-13 6.54e+14 <2e-16 ***
#> dnb_results$without_pct -1.00e+00 1.63e-15 -6.12e+14 <2e-16 ***
#> dnb_results$B_pct -1.00e+00 2.93e-15 -3.41e+14 <2e-16 ***
#> dnb_results$AB_pct -1.00e+00 1.96e-15 -5.11e+14 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.25e-12 on 139537 degrees of freedom
#> (39 observations deleted due to missingness)
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.79e+29 on 3 and 139537 DF, p-value: <2e-16
cluster Essai
dnb_pct_dep <- dnb_results %>%
group_by(department, session) %>%
summarise(AB_pct_dep = mean(AB_pct, na.rm = TRUE),
B_pct_dep = mean(B_pct, na.rm = TRUE),
TB_pct_dep = mean(TB_pct, na.rm = TRUE),
without_pct_dep = mean(without_pct, na.rm = TRUE),
success_rate_pct_dep = mean(success_rate_pct, na.rm = TRUE))
pairs(dnb_pct_dep[2:6])
distance <- dist(dnb_pct_dep)
#> Warning in dist(dnb_pct_dep): NAs introduced by coercion
mydata.hclust <- hclust(distance)
plot(mydata.hclust)